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ABSTRACT 

We present a system and a set of techniques for learning 
linear predictors with convex losses on terascale datasets, 
with trillions of features]^ billions of training examples and 
millions of parameters in an hour using a cluster of 1000 
machines. Individually none of the component techniques 
is new, but the careful synthesis required to obtain an effi- 
cient implementation is a novel contribution. The result is, 
up to our knowledge, the most scalable and efficient linear 
learning system reported in the literature. We describe and 
thoroughly evaluate the components of the system, showing 
the importance of the various design choices. 

1. INTRODUCTION 

Distributed machine learning is a research area that has 
seen a growing body of literature in recent years. Much work 
focuses on problems of the form 



mm 



w^x^; yi) 



(1) 



where x^ is the feature vector of the z-th example, yi is the 
label, w is the linear predictor, ^ is a loss function and R a 
regularizer. Much of this work exploits the natural decom- 
posability over examples in ([T]), partitioning the examples 
over different nodes in a distributed environment such as a 
cluster. 

Perhaps the simplest learning strategy when the number 
of samples n is very large is to subsample a smaller set of 
examples that can be tractably learned with. However, this 
strategy only works if the problem is simple enough or the 
number of parameters is very small. The setting of interest 
here is when a large number of samples is really needed to 
learn a good model, and distributed algorithms are a natural 
choice for such scenarios. 



^The number of features here refers to the number of non- 
zero entries in the data matrix. 
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Some prior works (McDonald et al. 2010 Zinkevich et al. 



2010) consider online learning with averaging and Duchi 
et al. (|2010a") propose gossip-style message passing algo- 



rithms extending the existing literature on distributed con- 
vex optimization (Bertsekas and Tsitsiklis, 1989). Langford] 
et al.| ( 2009) analyze a delay ed version of distributed online 
learning. Dekel et al. (2010) consider mini-batch versions of 
online algorithms which are extended to delay-based updates 



in Agarwal and Duchi (2011 ). A recent article of Boyd et al 



( |2011| ) describes an application of the ADMM technique for 



distributed learning problems. GraphLab (Low et al.| 2010) 



is a parallel computation fram ework on graph s. More closely 
related to our work is that of Teo et al. ( 2007 ) who use MPJj 
to parallelize a bundle method for optimization. 

However, all of the aforementioned approaches seem to 
leave something to be desired empirically when deployed on 
large clusters. In particular their throughput — measured as 
the input size divided by the wall clock running time — is 
smaller than the the I/O interface of a single machine for 
almost all parallel learning algorithms ( Bekkerman et ah) 
2011[ Part HI, page 8). The I/O interface is an upper bound 
on the speed of the fastest sequential algorithm since all 
sequential algorithms are limited by the network interface 
in acquiring data. In contrast, we were able to achieve a 
throughput of 500M features/s, which is about a factor of 5 
faster than the IGb/s network interface of any one node. 

An additional benefit of our system is its compatibility 
with MapReduce clusters such as Hadoop (unlike MPI-based 
systems) and minimal additional programming effort to par- 
allelize existing learning algorithms (unlike MapReduce ap- 
proaches) . 

One of the key components in our system is a communi- 
cation infrastructure that efficiently accumulates and broad- 
casts values across all nodes of a computation. It is function- 
ally similar to MPI AllReduce (hence we use the name), but 
it takes advantage of and is compatible with Hadoop so that 
programs are easily moved to data, automatic restarts on 
failure provide robustness, and speculative execution speeds 
completion. Our optimization algorithm is a hybrid on- 
line+batch algorithm with non-uniform parameter averag- 
ing. 

The paper is organized as follows. In Section |2] we discuss 
the approach used and the communication infrastructure we 
setup. Most of our effort is devoted to Section [3] where we 
conduct many experiments comparing with existing algo- 
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Figure 1: AllReduce 



rithms and various design choices within our own algorithm. 
In Section |4] we discuss and contrast our approach with the 
many approaches people have proposed for parallel learning. 

2. COMPUTATION AND COMMUNICATION 
FRAMEWORK 

Map-Reduce ( Dean and Ghemawat 2008| and its open 
source implementation Hadoopj^have become the overwhelm- 
ingly favorite platforms for distributed data processing in 
general. However, the abstraction is rather ill-suited for 
machine learning algorithms as several researchers in the 
field have observed (Low et al. 2010 Zaharia et al. 2011), 



because it does not easily allow iterative algorithms, such 
as typical optimization algorithms used to solve the prob- 
lem (^. 

2.1 Hadoop-compatible AllReduce 

AllReduce is a more suitable abstraction for machine learn- 
ing algorithms. AllReduce is an operation where every node 
starts with a number and ends up with the sum of the num- 
bers at all the nodes. A typical implementation is done by 
imposing a tree structure on the communicating nodes — 
numbers can be summed up the tree (this is the reduce 
phase) and then broadcast down to all nodes — hence the 
name AllReduce. See Figure [l] for a graphical illustration. 
When doing summing or averaging of a long vector, such as 
the weight vector w in the optimization ([T]), the reduce and 
broadcast operations can be pipelined over the vector entries 
hence the latency of going up and down the tree becomes 
neglibible on a typical Hadoop cluster. 

For problems of the form ([T]) , AllReduce provides straight- 
forward parallelization — we just accumulate local gradients 
for a gradient based algorithm like gradient descent or L- 
BFGS. In general, any statistical query algorithm (Kearns, 
[1993^ can be parallelized with AllReduce with only a handful 
of additional lines of code. This approach also easily imple- 
ments averaging parameters of online learning algorithms. 

An implementation of AllReduce is available in the MPI 
package. However, it is not easy to run MPI on top of ex- 
isting Hadoop clusters ( Ye et al.] 2009| ). Moreover, MPI im- 
plements little fault tolerance, with the bulk of robustness 
left to the programmer. 

To address the reliability issues better, we developed an 
implementation of AllReduce that is compatible with Hadoop. 
Implementation of AllReduce using a single tree is clearly 



less desirable than MapReduce in terms of reliability, be- 
cause if any individual node fails, the entire computation 
fails. To deal with this, we use a simple trick below which 
makes AllReduce reliable enough to use in practice for com- 
putations up to lOK node hours. 

2.2 Proposed Algorithm 

Our main algorithm is a hybrid online+batch approach. 
We start with each node making one online pas s over its local 
data according to adaptive gradient updates ( Duchi et al.| 
2010b| [McMahan and Streeter J2010D modified for loss non- 



linearity ( Karampatziakis and Langford 2011). AllReduce 



is used to average these weights non-uniformly using the 
local gradients. Concretely, node k maintains a local weight 
vector w'^ and a diagonal matrix G'^ based on the gradients 
in the adaptive gradient updates (see Algorithm [T]). We 
compute the following weighted average over all m nodes 



(2) 



This has the effect of weighing each dimension according to 
how "confident" each node is in its weight (i.e., more weight 
is assigned to a given parameter of a given node, if that 
node has seen more examples with the corresponding fea- 
ture). We note that this averaging can indeed be imple- 
mented using AllReduce by two calls to the routine since 
are only diagonal. This solution w is used to initialize 
L-BFGS (Nocedal 1980) with the standard Jacobi precondi- 
tioner. At each iteration, the local gradients are summed up 
using AllReduce, while all the other operations can be done 
locally at each node. The algorithm benefits from the fast 
reduction of error initially that an online algorithm provides, 
and rapid convergence in a good neighborhood guaranteed 
by Quasi- Newton algorithms. 

Another strategy we evaluate is that of repeated online 
learning with averaging using the adaptive updates. In this 
setting, each node performs an online pass over its data and 
then we average the weights according to Equation [2] We 
average the scaling matrices similarly 



G= EgM E(G'= 
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and use this averaged state to start a new online pass over 
the data. We will see in the next section that this strategy 
can be very effective at getting a moderately small test error 
very fast, but might not be able to get a very small test error. 

Note that our implementation is open source in Vowpal 
Wabbit ( [Langford et al.| 2007) and is summarized in algo- 
rithm [2] It makes use of the stochastic gradient descent 
(Algorithm [1]) for the initial pass. 

2.3 Speculative Execution 

It is common for large clusters of machines to be busy with 
many jobs which use the cluster in an uneven way, commonly 
resulting in one of a thousand nodes being very slow. To 
avoid this, Hadoop can speculatively execute a job on iden- 
tical data, using the first job to finish and killing the other 
one. In our framework, it can be tricky to handle duplicates 
once a spanning tree topology is created for AllReduce. For 
this reason, we delay the initialization of the spanning tree 
until each node completes a pass over the data, building the 
spanning tree on only the speculative execution survivors. 



Algorithm 1 Stochastic gradient descent algorithm on a 



single node using adaptive gra dient update (Duchi et al, 
2010b| [McMahan and Streeter} [2010 ) . 



Require: Invariance update function s 



(see ^Karampatziakis and Langfordj 2011) 
w = 0, G = I 

for all (x, y) in training set do 
Vw^(w^x; y) 
w ^ w — s(w, X, 2/)G~"'"^^g 



end for 



Gjj +g, for ah j = 1, 



Algorithm 2 Sketch of the proposed learning architecture 
Require: Data split across nodes 
for all nodes k do 

w'^ = result of stochastic gradient descent 
on the data of node k using Algorithm [l] 
end for 

Compute the weighted average w as in ([2]) 

using AUReduce. 
Start a preconditioned L-BFGS optimization from w. 
for all nodes k do 
for t = 1, . . . ,T do 

Compute g'^ the (local batch) gradient 

of examples on node k 
Compute g = X^feLi using AUReduce. 
Add the regularization part in the gradient. 
Take an L-BFGS step, 
end for 
end for 



The net effect of this speculative execution trick is perhaps 
another order of magnitude of scalability and reliability in 
practice. Indeed, we found the system reliable enough for 
up to 1000 nodes running failure-free for hundreds of trials. 
This significant gain from Hadoop's built-in fault tolerance 
highlights the benefits of a Hadoop-compatible implementa- 
tion of AUReduce. We will show the substantial gains from 
speculative execution in mitigating the "slow node" problem 
in the experiments. 

3. EXPERIMENTS 
3.1 Datasets 

Display advertising. 

In online advertising, given a user visiting a publisher 
page, the problem is to select the best advertisement for 
that user. A key element in this matching problem is the 
click-through rate (CTR) estimation: what is the probabil- 
ity that a given ad will be clicked given some context (user, 
page visited)? Indeed, in a cost-per-click (CPC) campaign, 
the advertiser only pays when the ad gets clicked, so even 
modest improvements in predictive accuracy directly effect 
revenue. 

There are several features representing the user, page, ad, 
as well as conjunctions of these features. Some of the fea- 
tures include identifiers of the ad, advert iser, publisher and 
visited page. These features are hashed j Weinberger et ah] 
2009| ) and each training sample ends up being represented 
as sparse binary vector of dimension 2^^ with around 100 



non-zero elements. Let us illustrate the construction of a 
conjunction feature with an example. Imagine that an ad 
from etrade was placed on finance . yahoo . com. Let h be 
a 24 bit hash of the string "publisher=finance .yahoo . com 
and advertiser=etrade". Then the (publisher, advertiser) 
conjunction is encoded by setting to 1 the h-ih dimension of 
the feature vector for that example. 

Since the data is unbalanced (low CTR) and because of 
the large number samples, we subsampled the negative ex- 
amples resulting in a class ratio of about 2 negatives for 1 
positive, and used a large test set drawn from days later than 
the training set. There are 2.3B samples in the training set. 

Splice Site Recognition. 

The problem consists of recognizing a human acceptor 
splice site (Sonnenburg and Franc |2010 ). We considered 
this learning task because this is, as far as know, the largest 
public data for which subsampling is not an effective learning 
strategy. So nnenburg et aT] ( 2007 ) introduced the weighted 
degree kernel to learn over DNA sequences. They also pro- 
posed an SVM training algorithm for that kernel; the learn- 
ing o ver lOM sequences took 24 days. In (Sonnenburg and| 
|Franc[ |2010), an improved training algorithm is proposed in 
which the weight vector — in the feature space induced by 
the kernel — is learned, but the feature vectors are never ex- 
plicitly computed. This resulted in a faster training: 3 days 
with 50M sequences. 

We follow the same experimental protocol as in S onneiT] 
burg and Franc | (201 0): we use the same training and test 
sets of respectively 50M and 4.6M samples. We also con- 
sider the same kernel of degree d = 20 and hash size 7 = 12. 
The feature space induced by this kernel has dimensionality 
11,725,480. The number of non-zero features per sequence 
is about 3,300. Unlike [Sonnenburg and Franc| ([20Tq| , we 
explicitly compute the feature space representation of the 
samples, yielding about 3TB of data. This explicit represen- 
tation is a disadvantage we imposed on our method, purely 
as a matter of implementation time. 

3.2 Results 

Effect of subsampling. 

The easiest way to deal with a very large training set is 
to subsample it as discussed in the introduction. Sometimes 
similar test errors can be achieved with smaller training sets 
and there is no need of large scale learning in these cases. 
For sp lice site recognition. Table 2 of |Sonnenburg and Franc] 
(2010) shows that smaller training sets do hurt the area 



under the precision/recall curve on the test set. 

For display advertising, we subsampled the data at 1% 
and 10%. The results in Table show that there is a notice- 
able drop in accuracy after subsampling. Note that even if 
the drop does not appear large at a first sight, it can cause a 
substantial loss of revenue. Thus, for both datasets, the en- 
tire training data is needed to achieve optimal performances. 

The three metrics reported in Table [l] are area under the 
ROC curve (auROC), area under the precision/recall curve 
(auPRC) and negative log-likelihood (NLL). Since auPRC 
is the most sensitive metric, we report test results using that 
metric in the rest of the paper. This is also the metric used 
in [Sonnenburg and Franc (2010). 



Running time. 



Table 1: Test performance on the display advertising 
problem as a function of the subsampling rate. 





1% 


10% 


100% 


auROC 


0.8178 


0.8301 


0.8344 


auPRC 


0.4505 


0.4753 


0.4856 


NLL 


0.2654 


0.2582 


0.2554 



Table 2: Distribution of computing time (in seconds) 
over 1000 nodes. First three columns are quantiles. 
Times are average per iteration (excluding the first 
one) for the splice site recognition problem. The 
first row is without speculative execution while the 
second row is with speculative execution. 





5% 


50% 


95% 


Max 


Comm. time 


Without 


29 


34 


60 


758 


26 


With 


29 


33 


49 


63 


10 



We ran 5 iterations of L-BFGS on the splice site data 
with 1000 nodes. On each node, we recorded for every itera- 
tion the time spent in AllReduce and the computing time — 
defined as the time not spent in AllReduce. The time spent 
in AllReduce can further be divided into stall time — waiting 
for the other nodes to finish their computation — and com- 
munication time. The communication time can be estimated 
by taking the minimum value of the AllReduce times across 
nodes. 

The distribution of the computing times is of particular 
interest because the speed of our algorithm depends on the 
slowest node. Statistics are shown in Table |2] It appears 
that most computing times are concentrated around the me- 
dian, but there are a few outliers. Without speculative ex- 
ecution, one single node was about 10 times slower than 
the other nodes; this has the catastrophic consequence of 
slowing down the entire process by a factor 10. The use of 
speculative execution successfully mitigated this issue. 

Finally, we study the running time as a function of the 
number of nodes. For the display advertising problem, we 
varied the number of nodes from 10 to 100 and computed 
the speed-up factor relative to the run with 10 nodes. In 
each case, we measured the amount of time needed to get to 
a fixed test error. Since there can be significant variations 
from one run to the other — mostly because of the cluster 
utilization — each run was repeated 10 times. Results are 
reported in Figure [2] We note that speculative execution 
was not turned on in this experiment, and we expect better 
speedups with speculative execution. 

Table [3] shows the running times for attaining a fixed test 
error as a function of the number of nodes on the splice site 
recognition problem. Unlike Figure [2] these timing results 
have not been repeated and there is thus a relatively large 
uncertainty on their expected values. It can be seen from 
Tables [2] and [3] that even with as many as 1000 nodes, com- 
munication is not the bottleneck. One of the main challenges 
instead is the "slow node" issue. This is mitigated to some 
degree by speculative execution, but as the number of nodes 
increases, so does the likelihood of hitting slow nodes. 

Finally we experimented with an 8 times larger version 
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Figure 2: Speed-up for obtaining a fixed test er- 
ror, on the display advertising problem, relative to 
the run with 10 nodes, as a function of the num- 
ber of nodes. The dashed corresponds to the ideal 
speed-up, the solid line is the average speed-up over 
10 repetitions and the bars indicate maximum and 
minimal values. 



Table 3: Computing time on the splice site recogni- 
tion data with various number of nodes for obtaining 
a fixed test error. The first 3 rows are average per 
iteration (excluding the first one). 



Nodes 


100 


200 


500 


1000 


Comm time / pass 


5 


12 


9 


16 


Median comp time / pass 


167 


105 


43 


34 


Max comp time / pass 


462 


271 


172 


95 


Wall clock time 


3677 


2120 


938 


813 
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Figure 3: Effect of initializing the L-BFGS optimiza- 
tion by an average solution from online runs on in- 
dividual nodes. 



Table 4: auPRC after one online followed by 5 L- 
BFGS iterations. 





No avg. 


Unif. avg. 


Weighted avg. 


Display 


0.4729 


0.4815 


0.4810 


Splice 


0.4188 


0.3164 


0.4996 



of the display advertising data (16B examples). Using 1000 
nodes and 10 passes over the data, the training took only 70 
minutes 

Online and batch learning. 

We now investigate the number of iterations needed to 
reach a certain test performance for different learning strate- 
gies: batch, online and hybrid. 

First, Figure |3] compares two learning strategies — batch 
with and without an initial online pass — on the training set. 
It plots the optimality gap, defined as the difference between 
the current objective function and the optimal one (i.e. min- 
imum value of the objective ([T])), as a function of the number 
iterations. From this figure, one can see that the initial on- 
line pass results in a saving of about 10-15 iterations. 

Figure |4] shows the test auPRC, on both datasets, as a 
function of the number of iterations for 4 different strate- 
gies: only online learning, only L-BFGS learning, and 2 hy- 
brid methods consisting of 1 or 5 passes of online learning 
followed by L-BFGS optimization. L-BFGS with one online 
pass appears to be the most effective strategy. 

For the splice recognition problem, an initial online pass 
and 14 L-BFGS iterations yield an auPRC of 0.581, which 
is just a bit higher than Sonnenburg and Franc (2010). This 
was achieved in 1960 seconds using 500 machines, resulting 
in a 68 speed-up factor (132581 seconds on a single machine 
reported in Table 2 of Sonnenburg and Franc (2010)). 



Averaging. 

Table [4] compares picking one online run at random, us- 
ing uniform weight averaging, or using non-uniform weight 
averaging according to Equation [2] from adaptive updates. 
Note that the random pick for splice was apparently lucky, 
and that weighted averaging works consistently well. 

3.3 Comparison with previous approaches 

AllReduce vs. MapReduce. 

The standard way of using MapReduce for iterative ma- 



chine learning algorithms is the following ( Chu et al. 2007 ) : 
every iteration is a M/R job where the mappers compute 
some local statistics (such as a gradient) and the reducers 
sum them up. This is ineffective because each iteration has 
large overheads (job scheduling, data transfer, data parsing, 
etc.). We have an internal implementation of such a M/R al- 
gorithm. We updated this code to use AllReduce instead and 
compared both versions of the code in Table |5] This table 
confirms that Hadoop MapReduce has substantial overheads 
since the training time is not much affected by the dataset 
size. The speedup factor of AllReduce over Hadoop MapRe- 



"^As mentioned before, there can be substantial variations in 
timing between different runs; this one was done when the 
cluster was not much occupied. 



Table 5: Average training time per iteration of an 
internal logistic regression implementation using ei- 
ther MapReduce or AllReduce for gradients aggre- 
gation. The dataset is the display advertising one 
and a subset of it. 





Full size 


10% sample 


MapReduce 


1690 


1322 


AllReduce 


670 


59 



duce can become extremely large for smaller datasets, and 
remains noticeable even for the largest datasets. 

It is also noteworthy that all algorithms described in |Chu| 
et al. (2007) can be parallelized with AllReduce, plus further 



algorithms such as parameter averaging approaches. 

Overcomplete average. 

We implemented oversampled stochastic gradient with fi- 
nal averaging (Zinkevich et al. 2010), and compared its per- 



formance to our algorithm. We used stochastic gradient de- 
scent with the learning rate in the t-th iteration as 

We tune 7 and L on a small subset of the dataset. 

In Figure [5] we see that the oversampled SGD is compet- 
itive with our approach on the display advertising data set, 
but its convergence is much slower on splice site recognition 
data. 

P arallel online min i-batch. 

Dekel et al.| ( |2010| propose to perform online convex op- 



timization using stochastic gradients accumulated in small 
mini-batches across all nodes. We implemented SGD ver- 
sion of their algorithm using AllReduce. They suggest global 
minibatch sizes of no more than b oc ^/n. On m nodes, each 
node accumulates gradients from b/m examples, then an 
AllReduce operation is carried out, yielding the mini-batch 
gradient, and each node performs a stochastic gradient up- 
date with the learning rate of the form 



We tuned L and 7 on a smaller dataset. In Figure [5] we 
report the results on splice data set, using 500 nodes, and 
mini-batch size b — lOO/c. Twenty passes over the data thus 
corresponded to 10k updates. Due to the ovewhelming com- 
munication overhead associated with the updates, the over- 
all running time was 40 hours. In contrast, L-BFGS took 
less than an hour to finish 20 passes, and obtained a much 
superior performance. The difference in the running time 
between Ih and 40h is solely due to communication. Thus, 
in this instance, we can conservatively conclude that the 
communication overhead of 10k mini-batch updates is 39 
hours. 

We should point out that it is definitely possible that the 
mini-batched SGD would reach similar accuracy with much 
smaller mini-batch sizes (for 10k updates theory suggests 
we should use mini-batches of size at most 10k), however, 
the 39 hour communication overhead would remain. Using 
larger mini-batches, we do expect that the time to reach 20 




Figure 4: Test auPRC for 4 different learning strategies. Left: splice site recognition; right: display adver- 
tising. 



passes over data would be smaller (roughly proportional to 
the number of mini-batch updates) , but according to theory 
(as well as our preliminary experiments on smaller subsets 
of splice data), we would have inferior accuracy. Because of 
the prohibitive running time, we were not able to tune and 
evaluate this algorithm on display advertising data set. 

Parallel online learning. 

Finally we compared o ur approach using the online par- 
allel learning algorithm of iHsu et al. (2011) using the same 



online advertising dataset in their paper. We note that this 
is a substantially smaller dataset with about lOM examples, 
and 125G non-zero features in the data matrix. We did not 
run th is compari s on on our larger datasets since the meth- 
ods in Hsu et al. (2011 ) do not scale well to a large number 
of nodes, as evident from Figure 5 of their paper: with 8 
nodes, the speed-up is only a factor of 2. For both algo- 
rithms, we set the number of passes over the data to reach 
a certain test error. This number turned out to be 18 for 
the parallel online learning and 20 for our algorithms. The 
running time using 8 nodes was 35 minutes for the parallel 
online learning and 16 minutes for ours. 

4. PROBLEMS WITH OTHER APPROACHES 
AND COMMUNICATION COST 

Here we discuss the limitations of existing approaches and 
systems. In many cases, it is helpful to compare the com- 
munication cost. Computational cost is also important in 
general, but it turns out to be non-distinguishing for the 
algorithms we consider here while communication cost anal- 
ysis aligns well with our empirical observations. Because 
modern switches are quite good at isolating communicating 
nodes, the most relevant communication cost is the max- 
imum (over nodes) of the communication cost of a single 
node. 

Several variables are important: 

1. m the number of nodes. 

2. n the number of examples. 



3. s the number of nonzero features per example. 

4. d the dimension of the parameters. 

5. T the number of passes over the examples. 

In the large-scale applications that are subject of this paper, 
we typically have s ^ d <^ n, where both d and n are 



substantially large (see Section 3.1). 

The way that data is dispersed across a cluster is relevant 
in much of this discussion since an algorithm not using the 
starting format must pay the communication cost of redis- 
tributing that. We assume the data is distributed across 
the nodes uniformly according to an example partition, as 



IS common. 



The per-node communication cost of the hybrid algorithm 
is G((iThybrid) where Thybrid is typically about 15 to maxi- 
mize test accuracy in our experiments. Note that the min- 
imum possible communication cost is G((i) if we save the 
model on a single machine. There is no communication in- 
volved in getting data to workers based on the data format 
assumed above. An important point here is that every node 
has a communication cost functionally smaller than the size 
of the dataset, because there is no dependence on ns. 

Similar to our approach, |Teo et aL ( 2007 ) propose a par- 
allel batch optimization algorithm (specifically, a bundle 
method) using the MPI implementation of AllReduce. This 
is a solid approach which arrives at an accurate solution 
with 0((iTbundie) communicatlou per node. Our approach 
improves over this in several respects. First, as Figure [4] 
demonstrates, we obtain a substantial boost thanks to our 
warmstarting strategy, hence in practice we expect Tbundie > 
^hybrid- The second distinction is in the AllReduce imple- 
mentation. Our implementation is well aligned with Hadoop 
and takes advantage of speculative execution to mitigate the 
slow node problem. On the other hand, MPI assumes full 
control over the cluster, which needs to be carefully aligned 
with Hadoop's Map-Reduce scheduling decisions, and by it- 
self, MPI does not provide robustness to slow nodes. 

Batch learning can also be imple mented using Map- Reduce 
on a Hadoop cluster (Chu et al. 2007J , for example in the 




Effective number of passes over data Effective number of passes over data 



Figure 5: Test auPRC for different learning strategies as a function of the effective number of passes over 
data. In L-BFGS, it corresponds to iterations of the optimization. In overcomplete SGD with averaging 
(Zinkevich et al.), it corresponds to the replication coefficient. Left: splice site recognition; right: display 
advertising. 



Table 6: Communication cost of various learning algorithms. Here n is the number of examples, s is the 
number of nonzero features per example, d is the number of dimensions, T is the number of times the 
algorithm examines each example, and b is the minibatch size (in minibatch algorithms). 



Algorithm 


Per-node communication cost 


Bundle method (Teo et al., 2007j) 

Online with averaging (McDonald et al. 2010 Hall et al. 2010) 




0((iTbundle) 
0{dTon\ine) 
0(ns/m + nTonline') 

O [ns + d) 

O {dTminifl/b) = O (dTminiV^) 
O (bsTminin/b) = O (nsTmini) 
0((iThybrid) 


Parallel online (|Hsu et al.||2UlT|) 

Overcomplete online with averaging (Zinkevich et al.U2010l 


Distrib. minibatch (dense) (Dekel et al. pUlU; Agarwal and Duchi J 


2011) 


Distrib. minibatch (sparse) (jDekel et al.[|2010| | Agarwal and Duchi[ 


mi) 


Hybrid online+batch 



Mahout projecl[^ Elsewhere it has been noted that Map- 
Reduce is not well suited to iterative machine learning algo- 
rithms ( Low et aL| |2010| Zaharia et al. 2011). Evidence of 
this is provided by the Mahout project itself, as their imple- 
mentation of logistic regression is not parallelized. Indeed, 
we observe substantial speedups from a straightforward sub- 
stitution of AllReduce for MapReduce on Hadoop. It is also 
notably easier to program with AllReduce, as code does not 
require refactoring. 

The remaining approaches are based on online convex op- 
timization. [McDonald et al.|([2010) and iH all et al.| (|2010| 
study the approach when each node runs an online learning 
algorithm on its examples and the results from the individ- 
ual nodes are averaged. This simple method is empirically 
rather effective at creating a decent solution. The commu- 
nication cost is similar to our algorithm B(c/Toniine) when 
Toniine passcs are done. However, as we saw in Figure ^ 
empirically, 

Tonline ^ ^hybrid- AlsO, WC haVC obscrVCd that 

no n- uniform averaging approaches can provide a significant 
performance boost (see Table |4]). 



Zinkevich et al.| ( 2010 ) also carry out separate online op- 



timization on each node, followed by global averaging, but 
they propose to use an overcomplete partition of the data 



http : //mahout . apache . org/ 



set. Our experiments show that this algorithm can have 
competitive convergence (e.g., on display advertising data), 
but on more difficult optimization problems it can be much 
slower than the hybrid algorithm we use here (e.g., on splice 
site recognition data). This approach also involves deep 
replication of the data — for example having 1/4 of the ex- 
amples on each of 100 nodes. This is generally undesirable 
with large datasets. The per-node communication cost is 
0{nsTrep/m + d) where Trep is the level of replication and 
m is the number of nodes. Here, the first term comes from 
the data transfer required for creating the overcomplete par- 
tition and the second term from the parameter averaging. 
When Trep/m is often a constant near 1 (0.25 was observed 
by [Zinkevich et aL]|2010[ and the theory predicts only a con- 
stant factor improvement), this implies the communication 
cost is 6(ns), the size of the dataset. 

Other authors have looked into online mini-batch opti- 
mization (Dekel et aT] 2010 Agarwal and Duchi[ 2011). 
The key problem here is the communication cost. The per- 
node communication cost is G(Tminidn/b) where b is the 
minibatch size (number of examples per minibatch summed 
across all nodes) , Tmini is the number of passes over the data, 
n/b is the number of minibatch updates per pass and d is 
the number of parameters. According to theory b < ^Jn^ 



implying communication costs of S{Tminidy/n). While for 
small minibatch sizes Tmini can be quite small (plausibly 
even smaller than 1), when d is sufficiently large, this com- 
munication cost is prohibitively large. In particular, if Tmini 
is at least a constant, the final communication cost is greater 
than distributing the entire dataset. This is the reason for 
the slow performance of mini-batched optimization that we 
observed in our experiments. Reworking these algorithms 
with sparse parameter updates, the communication cost per 
update becomes bs yielding an overall communication cost 
of G(Tmini^s), which is still several multiples of the dataset 
size. Empirically, it has also been noted that after optimiz- 
ing learning rate parameters, the optimal minibatch size is 
[20TT| . 



often 1 (Hsu et al. 



Another category of algorithms is those which use online 
learning wi th a feature based partition of examples (HsuJ 
[et al.[ |2011| . Several families of algorithms have been tested 
in this setting including delayed updates, minibatch, sec- 
ond order minibatch, independent learning, and backprop. 
The per-node communication costs differ substantially here. 
Typical communication costs are G(ns/m + nToniine') where 
the first term is due to shuffling from an example-based for- 
mat, and the second term is for the run of the actual al- 
gorithm. This has a similar tract ability to the algorithm 
we consider here, particularly if the data is organized in a 
feature partition eliminating the first term. However, the 
programming is substantially more delicate and no experi- 
ments of the scales we consider have been conducted. 

5. CONCLUSION 

We have shown that a new architecture for parallel learn- 
ing based on a Hadoop-compatible implementation of AllRe- 
duce can yield a combination of excellent prediction and 
training time performance in an easy programming style. 
The hybrid algorithm we employ allows us to benefit from 
the rapid initial optimization of online algorithms and the 
high precision of batch algorithms where the last percent of 
performance really matters. 

The combination of these techniques enables the training 
of linear predictors on datasets of size unmatched in the 
literature. 
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